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Abstract 


In practical applications of item response theory (IRT), item parameters are usually estimated first 
from a calibration sample. After treating these estimates as fixed and known, ability parameters 
are then estimated. However, the statistical inferences based on the estimated abilities can be 
misleading if the uncertainty of the item parameter estimates is ignored. Instead, estimated 
item parameters can be regarded as covariates measured with error. Along the line of this 
measurement-error-model approach, asymptotic expansions of the maximum likelihood estimator 
(MLE) and weighted likelihood estimator (WLE) of ability were derived by Zhang, Xie, Song, and 
Lu (2007). In this paper, we propose an estimator of an ability parameter based on the asymptotic 
formula of the WLE. A simulation study shows that the new estimator effectively reduces the bias 
of the MLE or WLE of ability caused by the uncertainty of the item parameter estimates not 
taken into account. 


Key words: Bias reduction, item response theory (IRT), maximum likelihood estimator (MLE), 
measurement error, weighted likelihood estimator (WLE). 
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1 Introduction 


In practical applications of item response theory (IRT), item parameters are usually estimated 
first from a calibration sample. After treating these estimates as fixed and known, ability 
parameters are then estimated and further statistical inferences are made. When item parameter 
estimation is sufficiently accurate, it may not be problematic to substitute the estimated item 
parameters for the true ones in the IRT models when estimating ability parameters. However, 
when the measurement errors in item parameter estimates are no longer ignorable, the statistical 
inferences based on such a substitution could be misleading. For instance, Tsutakawa and 
Johnson (1990) demonstrated that both the maximum likelihood and empirical Bayes approaches 
underestimate the variance of ability when the uncertainty of item parameter estimates is ignored. 

Given item parameters, Lord (1983, 1986) and Samejima (1993a, 1993b) used Taylor’s 
expansion of the likelihood equation to obtain an approximation for the bias and its standard 
error formulae for the maximum likelihood estimator (MLE) of ability in the context of different 
IRT models. Based on Lord’s bias function, Warm (1989) used the weighted likelihood estimation 
method to estimate ability parameters and showed that the weighted likelihood estimator (WLE) 
is less biased than the MLE with the same asymptotic variance and normal distribution. Assuming 
item parameters are known, the WLE method is effective in reducing bias. 

However, when item parameters are unknown and estimated item parameters are used as 
substitutes for the true ones in likelihood functions, as would be the case in all applications, 
the WLE method for ability estimation is not as effective for the 3PL case (Zhang, 2005). As a 
result, the measurement error in item parameter estimation must be considered as a potential 
contaminator of the ability estimation as well. The bias of the MLE of ability based on fixed 
estimated item parameters comes from two sources: (a) the bias of the MLE of ability given true 
item parameters, and (b) the measurement error from the uncertainty of the item parameters. 
Lord (1983, 1986), Warm (1989), and Samejima (1993a, 1993b) only investigated the first of these 
sources. Various approaches have also been proposed to address the measurement error resulting 
from the uncertainty of item parameters (Lewis, 1985, 2001; Mislevy, Wingersky, & Sheehan, 
1994; Song, 2003; Tsutakawa & Johnson, 1990; Zhang, Xie, Song, & Lu, 2007). One of these 
approaches, the measurement-error-model approach, treats estimated item parameters as 


1 



covariates measured with errors, instead of treating them as being fixed in nature (Song, 2003; 
Zhang, et al., 2007). Thus, a bias-correction formula can be developed along the line of what has 
been done in research on measurement error models (Stefanski & Carroll, 1985). In this paper, 
we propose a bias-corrected estimator of an ability parameter based on the asymptotic expansion 
formula of the WLE of ability. A simulation study is conducted to compare the new method with 
the MLE and WLE methods in terms of the bias and the root mean squared errors (RMSE) of 
estimated abilities. The result shows that the new estimator effectively reduces the bias in the 
cases considered in the simulation study. 


2 The Effect of Uncertainty About Item Parameters on Ability Estimation 

Suppose a test consists of n dichotomous items. Let y = (yi, • • ■, y n ) be the response 
vector of an examinee with yi = 1 (correct) or yi = 0 (incorrect) for i = 1 ,n. The item 
response function (IRF) of a 3PL model is 


Pi(9) = P(9 ; cii, bi , Ci ) = P(yi 


1 | 0 ) = Ci + (1 - Ci) 


1 + exp {—1.7ai(0 - 6,:)} ’ 


(1) 


where a*, 6*, and c* are the item discrimination, difficulty, and guessing parameters, respectively. 
Let 

P ^ 6) = l + exp{—1 .7 ai (9-bi)} (2 

denote a 2PL model. Thus, Pi{9) = c* + (1 — Cj)P*[9). The 3PL model is often rewritten as 


pm 


Ci + (1 


Ci) 


1 

1 + exp {—1.7(a,0 + di)}' 


(3) 


where di = —dibi is the intercept parameter. 

The MLE of examinee’s ability is commonly used in practice (Birnbaum, 1968; Wang 
& Vispoel, 1998; Yi, Wang, & Ban, 2001). Under the assumptions of local or conditional 
independence (Lord, 1980), the likelihood function for the response vector y is 

n 

L(y\9) = l[P? i (9)Q 1 i - yi (e), (4) 

i =1 

where Qi(9) = 1 — Pi(9). If item parameters (aj,6j,Cj) in these models are known, the MLE 
9 m of ability is defined as the value of 9 that maximizes (4). In practice, 9 m is often found by setting 
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the derivative of the likelihood function to zero; that is, 9 m satisfies 


<91nL(y | 9) 
89 


( Vi ~ Pj(0) \ 

h \w)Qm) 


Pi(0) = o, 


( 5 ) 


where P'(9) is the first derivative of Pi(9) with respect to 9 (see Lord, 1980). Since 
P-(9) = 1.7 ciiP*(9)Qi(9), the likelihood equation (5) becomes 

n 

Y J CLiK t {9)(y i -P t {9)) = 0, 

1=1 


( 6 ) 


where 


K i (9)=K(9-,a i ,b i ,c i ) 


pm __i_ 

Pi(9) 1 + Ci exp {-1.7a,i(9 - bi)}' 


Let 


m = E 


{p'{0)Y 


Pi(0)Qi(9) 

be the Fisher test information function. The variance of the MLE of 9 is Var(@) = 1/1(9). After 
some calculations, 

n 

1(9) = 1.7 2 5>f(l -c i )P:(9)Q*(9)K i (9), : (7) 


1=1 


where Q*(9) = 1 - P*(9). 

The likelihood function is strictly increasing or decreasing for an all-correct-response pattern 
(i.e., a perfect score) or an all-incorrect-response pattern (i.e., a zero score). Thus, the MLE of 
ability corresponding to a perfect score or a zero score is +oo or — oo. Bayes estimators of ability 
corresponding to perfect scores and zero scores can be finite if an informative prior distribution 
of ability is appropriately used. This is a major reason why a Bayesian method is sometimes 
preferred. In practice, examinees with perfect scores or zero scores are usually assigned the highest 
or lowest scores, such as an 800 or a 200 in SAT® subject tests. A Bayesian method basically also 
gives a fixed value for each of the two extreme cases given a fixed prior distribution. In effect, any 
value can be a reasonable estimate of abilities of examinees with perfect scores as long as the value 
is at least as large as the ability estimates of all other examinees, regardless of estimation methods. 
Similarly, a reasonable estimate of ability with a zero score should be no larger than the ability 
estimates of all other examinees. Therefore, the shortcoming of the MLE of ability corresponding 
to perfect scores and zero scores can be easily overcome by constraining the range of ability on a 
closed, but large enough, interval, say [—4,4], so that the MLE of ability for a perfect score 


3 



or a zero score is the upper or lower endpoint of the interval (see Lord, 1983; Zhang, 2005). Note 
that this paper will not further consider the bias of the ability estimates in these extreme cases. 
Given item parameters, Lord (1983) obtained the following bias function for the MLE of 9: 


where I%(9) is the item information function of item i, that is, 

W) = p^Q^ e) = 1 - 7 2 «?(1 - c i )P*(9)Q*(e)K i (e). 

The MLE with Lord bias-correction (MLE-LBC) of 9 is defined as 

9 C = 9 m - B(6 m ). 

The bias of 9 C , BIAS(0 c ), is o(n -1 ) (i.e., lim^^cx, nBIAS((9 c ) = 0) while BIAS((9 m ) is 0(?r _1 ) 

(i.e., nBIAS(0 m ) are bounded for all n ) under the assumption that the true values of the item 
parameters are known. 

Based on Lord’s work, Warm (1989) proposed the weighted likelihood estimation method and 
showed that the WLE of ability is less biased than the MLE with the same asymptotic variance 
and normal distribution. The WLE 9 W is defined as the value of 9 that maximizes 

n 

f(9)L(y | 9) = ml[py i (9)Q 1 i - yi (9), 

i= 1 

where f(9) is a suitably chosen function satisfying 

= _ mm 

Therefore, 9 W satisfies the following weighted likelihood equation, 

That is, 

n 

1 .7j2 a i K i(8)[yi - Pi(0)} - B(9)I(9) = 0. (9) 
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In reality, both item and ability parameters are unknown. As mentioned in the previous 
section, it is a common practice to estimate item parameters first and then treat the estimates 
as if they were the true quantities in estimating ability parameters. That is, the MLE of an 
ability parameter is obtained by assuming estimated item parameters a*, bi, and c. L are fixed as 
substitutes for true parameters. Thus, 9 rn satisfies 

n 

Y^a i k i [e){y i -P0)) = 0 ( 10 ) 

i= 1 

instead of (6), where Pi(9) = P{9\ on, bi, Cj) and Ki(9) = K(9;di,bi,Ci), while 9 W satisfies 

n 

1.7 J2aiKi{9)[ yi - Pi(9)] - B(9)I(9) = 0 (11) 

i— 1 

instead of (9). The MLE, or the WLE, based on these fixed estimated item parameters will 
converge to some value, say 6*, according to large sample theory under proper regularity 
conditions, when the number of items becomes larger and larger. However, 9* will not necessarily 
be the true ability parameter 9. Thus, the WLE and MLE-LBC methods actually try to reduce 
the “bias” against 9*, not the bias against the true 9 , since these methods just aim to reduce the 
bias of MLE given item parameters. 

In order to correct the bias properly, uncertainty about item parameters or errors of estimated 
item parameters should be also considered. Specifically, item parameter estimators can be 
regarded as covariates measured with error. Suppose that item parameters are estimated using 
a calibration sample with J examinees. Let di, bi , cij, and di be the item parameter estimators. 
Note that these estimators are related to J. The label J is usually suppressed in these and other 
related quantities for convenience, unless necessary. Let 

P{fli) — cij. T d a i, E(bi) — bi -\- b[,i, E(ci) — Cj T 6 c j, 

Var(dj) = a 2 ai , Var (bi) = <7&, Var (c*) = a 2 ci , (12) 

Co v(di,bi) = cr ab i, Cov(6j, Cj) = a bci , Cov(aj, q) = a aci , (13) 

where 5 a i, &bi > and 5 c i are the biases of corresponding item parameter estimators; <r 0 j, <7^, and 
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a c i are the corresponding standard errors; and a ab i, cr bci , and cr ac i are the covariances of item 
parameter estimators. In other words, item parameter estimators are measured with error, 

hi — CLi T 0 a i Sail 
bi = bi + 6hi + £ b ii 
C-i — Cj "t" $ci “l - E c i , 

where {(e a j, £ bi, £&?.)} is an independent sequence of random vectors 1 with mean zero and covariance 
matrix 

/ 2 \ 

Oai &abi &aci 

& abi Ofa &bci 
2 

y ^aci ®bci & c i ) 

The theorem below requires the regularity conditions. These conditions and their explanations 
or justifications will be presented first. 


Regularity Conditions 

(CO) Item parameters a* and bi are uniformly bounded and c* is bounded away from 1. 6 is a 
bounded variable. 

(Cl) There exists no such that for any n > no, 


where 


lim <7 n = 0, 

J—KX) 


Oai ? ®bi ? ®ci i ^ai ? ^bi j ^ci } * 

l<2<n 


(C2) 


71 i TL 

lim — VVar[(a, — a*) 2 ] = 0, lim — Var[(6j — bi) 2 } = 0, 

J—KX) 71 ^ J T —m'yi n £—^ 


i= 1 


J —>oo 71 


i=l 


^ it i n 

hnr - ^2 Var[(aj - ai) (bi - bi)} = 0, lim - ^ Var[(c, - a) 2 } = 0, 


1=1 


J— kx) 71 


i= 1 


^ n i n 

lim - Var[(aj - ai)(cj - c*)] = 0, lim - 'Sy Var[(6j - 6j)(c; - c;)] = 0. 

J —>oo n z — 4 n ' 


i =1 


J— KX) 71 


i= 1 
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(C3) (cii — cii)/a a i, (bi — bi)/abi, and (c, — Ci)/a c i have uniformly bounded 4 moments. 

(C4) For any fixed 9 , there exists cq(9) > 0 such that 

liminf I(9)/n > co(9) > 0. 

n—>oo 

In effect, (CO), which is also required by Lord (1983), holds in all applications. Regularity 
Condition (Cl) states that the biases and standard errors of item parameter estimators converge 
to zero when the calibration sample size tends to infinity, which means that item parameter 
estimation results from the calibration sample are reasonable. So is (C2). Regularity Condition 
(C3) is a very weak assumption under (CO). Regularity Condition (C4) should hold for all 
well-designed tests with reasonable IRT models when 6 is bounded. In fact, it is commonly 
assumed. For example, Chang and Stout (1993) also required this condition when proving the 
asymptotic posterior normality of the latent ability. 

Under the regularity conditions, Zhang et al. (2007) obtained the following asymptotic 
expansion results for the MLE and WLE of ability. In the following theorem, notations o p (•) and 
Op(-) are needed, so that F. m = G m + o p (H m ) means that ( F m — G m )/H m converges to zero in 
probability, and F m = O p (l) means that {F m } are bounded in probability, whereas o(-) and O(-) 
are in regular sense (see Serfling, 1980). Let 

r. (e) = Qm = 1 

Pi{9) Ci + exp {1.7di(0 — bi )}' 


Theorem (Zhang, Xie, Song, & Lu, 2007) 

Suppose that 0 m is the regular MLE of 9 and satisfies (10) and 9 W is the regular WLE of 9 
and satisfies (11), where estimated item parameters are regarded as fixed and known. Assume 
that Regularity Condition (C0)-(C4) hold. Then 

9 m = 9 F\J n {9) F Q n {0) + Z n {9)\/1(9) + Op , (14) 

and 

9 W = 9 + [J n (9) + Q n {9) + Z n (9) — B{0)!{&)]/1{&) + o p ^max , (15) 
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where 1(9) is the Fisher test information function given by (7), B(9) is given by (8) and 


Jn, l(0) 

Jn,2(9) 

Jn,m 

Jn A (0) 
Jn, 5 (0) 

Jnfi(9) 

Jn(0) 

Qn,l(9) 

Qn,2(9) 

Qn A (9) 

Qn(9) 

Z n (0) 


-1.7 2 Y^(0 - bi)(l - Ci)P*(6)Q*(6)Ki(9) 


i —1 


1.7 cii(6 - hi) 


--P*(9) + c i L i (9) 


-1.7 3 ^af(l -c i )P*(9)Q*(9)K i (e) 


i —1 


+ 1 \ i a li + ^ai), 

(. a bi + 


--P*(9) + c l L l (0)' 2 1 * 2 


1.7 2 ^2a,(l ~c i )P*(9)Q*(9)K i (9) 


7 = 1 


--P*(9) + c i L l (0) 


1 f (&abi ^ai^bi)") 


1-7 a,i(9 - bi 
1.7 a l Q*(0)K l (9)L l (9)W 2 ci + ^, 

i= 1, Cj>0 
n 

1.7 Y, Qi( 9 ) K *( 9 ) i l J< 9 - bi)[ 1 - 2c i L i (0)] - 1} {(T ac i + «*), 


7=1 Cj>0 
n 


-1.7 2 a 2 Q*(0)^(0)[l-2c i L i (0)](a 6ci + M ci ), 

7 = 1 Ci>0 

Jn,l ($) + Jn,2(9) + Jn A (9) + Jn A (9) + Jn,b(9) + Jnfi(9), 


-1.7 2 ^ Oi (0 - 60(1 - Ci)f?(0)Q*(0)tfi(0)<ya 

n 

1.7 2 ^a 2 (l - cO^(0)QI(0)AO(6)^, 


7=1 


-1.7 

7=l,Cj>0 

Qn,l($) + Qn,2(9) + Qn,3(9), 

n 

1-7 cii K i(9)(yi — Pi(9)). 

7=1 


The theorem provides the error terms or biases of the naive MLE and WLE of ability obtained 
by treating estimated item parameters as though they were the true values while they are actually 
associated with measurement error. The bias is asymptotically a function of the biases {5 a i, S^, 
5 c i} and covariance matrixes {£*} of item parameter estimators. Therefore, given {5 a i, 5i n , 5 c j} 



and {Sj}, one may calculate the values of biases of the MLE and WLE of ability using (14) and 
(15), respectively. One can also determine the range of the bias of the MLE or WLE of ability 
if the range of the biases, the variances, and the covariances of item parameter estimators are 
known. Thus, one can evaluate the impact of measurement errors of item parameter estimators on 
ability estimation and decide whether the naive MLE or WLE is accurate enough in the situation 
considered. 

Note that the term Q n (9) / 1(9) represents the component of the bias of the ability estimator 
that is caused by {S a i, Sbi, 8 c i} only. The term J n (9)/I(9 ) relates to the components of the bias 
caused by both {S a i, Sbi, S c i} and {Si}, while B(9), 1(9), and Z n (9 ) are independent from any of 
those quantities. Notice that Q n , 2(9) can be rewritten as Q n , 2 (d) = X 4 L 1 h(9)dbi, where Ii(9) is 
the item information function of item i , and Ya= 1 = I(@)- Thus, Q n , 2 (d )/ 1(9) is the weighted 

average bias of item difficulty parameters with item information as the weights. 

The theorem does not restrict the method of item parameter estimation for a calibration 
sample. Hence, any regular joint MLE, marginal MLE, or Bayesian estimation methods can be 
used to estimate item parameters before applying the theorem. However, the effectiveness of the 
theorem obviously depends on the accuracy of the estimation of the item parameters, the biases, 
the variances, and the covariances of item parameter estimators. 

When applying the theorem to a practical situation, one needs the estimates of {S a i, Sbi, 

5 C i} and {Ej}. Usually, a calibration program provides a set of estimation results either for 
parameters (aj,6j,Cj) of (1) or for parameters (ai,di,Ci) of (3). For example, PARSCALE 
presents an estimate of the covariance matrix of (d t . d t , q). Using the delta method, given 
the estimation results based on (3), one can obtain the results based on (1) and vice-versa. If 
a calibration program could not provide accurate enough estimates of {Ej}, one can always 
calculate the appropriate information matrix or Hessian matrix to obtain estimates of these 
covariance matrixes. However, estimates of 5 a i, Sbi, and S c i are typically not directly available 
from a calibration program. A Monte-Carlo simulation with some replications or the bootstrap 
method (Efron, 1982) is needed to obtain estimates of biases of item parameter estimators in 
practice. For example, one may use estimated item and ability parameters to generate 100 sets of 
simulated response data using a setting as similar as possible to the original data and then calibrate 
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each set of these data. The average of the discrepancies of newly estimated item parameters from 
the original (estimated) item parameters across 100 replications can be used as estimates for the 
bias of item parameter estimators. The sample covariance matrix of (cb,6j,Cj) based on the 100 
replications can also be calculated and used as a substitute for the estimate of the covariance 
matrix of ( a,i,bi,Ci ). Thus, B(9), 1(9), J n (9), Q n (9), and Z n (9) can be replaced by their estimates, 
B(6 ), 1(9), J n (9), Q n (9 ), and Z n (9 ), respectively. Here 9 is either 9 m or 9 W . 

In this paper, we focused on 9 W because the WLE produces slightly better results than the 
MLE and MLE-LBC (see Hoijtink & Boomsma, 1995; Zhang, 2005). By (11), we know that 
Z n (9 w ) — B(9 W )I(9 W ) = 0. Thus, we may only need to correct the bias in [, J n (9) + Q n (9)\/1(9) from 
the corresponding WLE to further reduce the bias of the WLE of 9. That is, the bias-corrected 
ability parameter estimator is 

0WC = 0w- [. Jn(L ) + Qn(9 w )]/i(9 w ). (16) 

This estimator is called the corrected weighted likelihood estimator (CWLE), indicating that the 
final estimator corrects error terms based on the WLE. 

3 A Simulation Study 

A simulation study was conducted to compare MLE, WLE, and CWLE. Specifically, 
the study attempted to determine which method produces the best ability-estimation result. The 
estimated item parameters from the 1998 National Assessment of Educational Progress (NAEP) 
grade 4 reading assessment were used to generate simulated response data (Allen, Donoghue, & 
Schoeps, 2001). Among 60 items used in the simulation study, there are 26 2PL items and 34 3PL 
items. These item parameters are presented in Table 1. 

The simulation study has two stages. In the first stage, item parameters are estimated from 
a simulated calibration sample. These estimated item parameters are used as fixed to estimate 
individual ability parameters of a simulated target sample in the second stage. 
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Table 1 


Item Parameters Used in the Simulation Study 


Item 

a 

b 

c 

Item 

a 

b 

c 

1 

0.623 

-0.872 

0.000 

31 

1.342 

-0.457 

0.175 

2 

0.920 

1.008 

0.000 

32 

1.110 

0.148 

0.244 

3 

1.052 

1.009 

0.000 

33 

1.228 

0.259 

0.247 

4 

0.754 

0.015 

0.000 

34 

0.951 

-0.864 

0.319 

5 

0.763 

-0.284 

0.000 

35 

1.472 

1.204 

0.167 

6 

1.025 

0.107 

0.000 

36 

1.859 

0.213 

0.265 

7 

0.647 

-1.008 

0.000 

37 

1.133 

0.916 

0.297 

8 

0.520 

-1.425 

0.000 

38 

1.374 

0.307 

0.269 

9 

0.757 

-0.630 

0.000 

39 

0.504 

-0.932 

0.247 

10 

0.832 

1.118 

0.000 

40 

1.415 

0.891 

0.271 

11 

1.123 

1.057 

0.000 

41 

2.303 

0.609 

0.418 

12 

0.814 

0.306 

0.000 

42 

0.966 

-1.318 

0.244 

13 

0.506 

-1.272 

0.000 

43 

1.029 

0.327 

0.300 

14 

0.269 

-0.904 

0.000 

44 

0.721 

-1.193 

0.247 

15 

1.172 

0.645 

0.000 

45 

0.941 

0.401 

0.264 

16 

0.877 

-0.523 

0.000 

46 

0.793 

0.642 

0.247 

17 

0.761 

-1.242 

0.000 

47 

1.032 

0.507 

0.248 

18 

0.619 

-1.113 

0.000 

48 

0.533 

-0.835 

0.218 

19 

1.154 

0.645 

0.000 

49 

1.203 

0.257 

0.165 

20 

1.536 

1.192 

0.000 

50 

1.104 

-0.155 

0.247 

21 

0.597 

1.341 

0.000 

51 

1.464 

0.774 

0.138 

22 

0.970 

0.906 

0.000 

52 

2.300 

0.416 

0.264 

23 

1.086 

-0.060 

0.000 

53 

0.562 

-0.073 

0.237 

24 

0.795 

-0.238 

0.000 

54 

0.883 

-1.015 

0.310 

25 

0.838 

-0.076 

0.000 

55 

1.261 

1.084 

0.206 

26 

1.031 

-0.310 

0.000 

56 

0.597 

-0.206 

0.156 

27 

1.506 

-0.495 

0.215 

57 

0.938 

-1.691 

0.294 

28 

0.607 

0.712 

0.251 

58 

1.414 

-0.608 

0.275 

29 

1.288 

0.554 

0.190 

59 

1.185 

-0.590 

0.312 

30 

1.798 

-0.899 

0.248 

60 

0.579 

-0.688 

0.276 


Note. Data from the 1998 NAEP Grade 4 Reading Assessment. 
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The numbers of examinees in simulated calibration samples are 250, 500, and 1,000. 
Examinees’ ability parameters were independently generated from a standard normal distribution. 
Based on these ability parameters and the item parameters shown in Table 1, 100 sets (for 100 
replications) of calibration response data were generated using IRT method for each of the three 
sample sizes. Each simulated data set was used to estimate item parameters separately. In this 
study, a NAEP version of PARSCALE (Allen et al., 1999; Muraki & Bock, 1991) was used to 
estimate item parameters. Tables 2-4 present the bias of estimated item parameters based on 
100 replications for sample sizes 250, 500, and 1,000, respectively. The covariance matrixes are 
not reported here because their sizes are too large. Each of these 300 sets of estimated item 
parameters will be used as fixed and known when estimating ability parameters in the next stage. 

In the second stage, the MLE, WLE, and CWLE methods are used to estimate ability 
parameters. Since the performance of MLE, WLE, or CWLE might be different at the different 
ability levels, we evaluated the ability-estimation accuracy at several ability levels. That is, we 
compared the results from the three ability-estimation methods to determine which method gives 
the best ability estimation result at these ability levels. Specifically, we chose 13 ability levels 
in this simulation. They are —3.0, —2.5, ..., 2.5, and 3.0. Using these ability values and the 
item parameters in Table 1, simulated response data were generated again using IRT method. 
Regarding the estimated item parameters from a calibration sample in the first stage as fixed and 
known, 9 m and 6 W were obtained for each examinee in the newly simulated response data. Then, 
the biases, variances, and covariances of estimated item parameters obtained in the first stage 
were used to calculate the bias-correction term in (16), [J n {9 w ) + Q n {9 w )\/I(9 W ), so as to obtain 
9 WC . For each set of estimated item parameters, the process was repeated 100 times. 

The measurement precision of 9 m . 9 W , and 9 WC was evaluated by comparing the conditional 
bias and the RMSE at each ability level. RMSE is the square root of the average of the squared 
deviations of estimated parameters from the true one. Tables 5-7 and Figures 1-3 show the biases 
and RMSEs of 9 rn , 9 W , and 9 WC at each of the 13 ability levels when item parameters are estimated 
from calibration samples of 250, 500, and 1,000 examinees, respectively. 
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Table 2 


Bias of Estimated Item Parameters With 
Calibration Sample Size 250, Based on 100 Replications 


Item 

a 

b 

c 

Item 

a 

b 

c 

1 

0.0225 

-0.0468 

0.0000 

31 

0.0458 

-0.0051 

0.0330 

2 

-0.0051 

-0.0341 

0.0000 

32 

-0.0429 

-0.0958 

-0.0162 

3 

-0.0269 

-0.0122 

0.0000 

33 

-0.0729 

-0.0894 

-0.0238 

4 

0.0309 

-0.0633 

0.0000 

34 

-0.0097 

-0.2190 

-0.0920 

5 

0.0159 

-0.0375 

0.0000 

35 

-0.1097 

0.0286 

0.0132 

6 

0.0029 

-0.0437 

0.0000 

36 

-0.2607 

-0.1298 

-0.0367 

7 

0.0440 

-0.0343 

0.0000 

37 

-0.1691 

-0.1687 

-0.0528 

8 

0.0531 

0.0427 

0.0000 

38 

-0.1416 

-0.1409 

-0.0372 

9 

0.0272 

-0.0594 

0.0000 

39 

0.0865 

0.0689 

-0.0149 

10 

0.0173 

-0.0293 

0.0000 

40 

-0.2276 

-0.0913 

-0.0334 

11 

-0.0060 

-0.0287 

0.0000 

41 

-0.9640 

-0.3142 

-0.1201 

12 

0.0299 

-0.0488 

0.0000 

42 

0.0271 

-0.0589 

-0.0189 

13 

0.0532 

0.0286 

0.0000 

43 

-0.0962 

-0.2028 

-0.0646 

14 

0.0857 

0.1224 

0.0000 

44 

0.0444 

-0.0651 

-0.0201 

15 

-0.0321 

-0.0431 

0.0000 

45 

-0.0303 

-0.1490 

-0.0337 

16 

0.0148 

-0.0428 

0.0000 

46 

0.0021 

-0.0784 

-0.0159 

17 

0.0633 

-0.0024 

0.0000 

47 

-0.0092 

-0.0763 

-0.0172 

18 

0.0407 

-0.0471 

0.0000 

48 

0.0661 

0.0435 

0.0144 

19 

0.0059 

-0.0556 

0.0000 

49 

0.0641 

-0.0092 

0.0331 

20 

-0.0643 

-0.0098 

0.0000 

50 

-0.0173 

-0.1002 

-0.0192 

21 

0.0333 

-0.0598 

0.0000 

51 

-0.0161 

-0.0055 

0.0301 

22 

0.0075 

-0.0250 

0.0000 

52 

-0.4895 

-0.1168 

-0.0377 

23 

-0.0034 

-0.0534 

0.0000 

53 

0.0671 

-0.0647 

-0.0027 

24 

0.0105 

-0.0681 

0.0000 

54 

-0.0231 

-0.2349 

-0.0822 

25 

0.0179 

-0.0542 

0.0000 

55 

-0.0725 

-0.0435 

-0.0052 

26 

0.0175 

-0.0554 

0.0000 

56 

0.0997 

0.1442 

0.0724 

27 

-0.0040 

-0.0713 

-0.0003 

57 

-0.0140 

-0.2004 

-0.0681 

28 

0.0866 

-0.0958 

-0.0127 

58 

-0.0977 

-0.1343 

-0.0505 

29 

-0.0025 

-0.0218 

0.0104 

59 

-0.0890 

-0.2341 

-0.0837 

30 

-0.1218 

-0.1043 

-0.0291 

60 

0.0521 

-0.1053 

-0.0440 
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Table 3 


Bias of Estimated Item Parameters With 
Calibration Sample Size 500, Based on 100 Replications 


Item 

a 

b 

c 

Item 

a 

b 

c 

1 

0.0198 

-0.0388 

0.0000 

31 

0.0851 

0.0180 

0.0394 

2 

-0.0074 

-0.0307 

0.0000 

32 

0.0084 

-0.0452 

-0.0007 

3 

-0.0065 

-0.0281 

0.0000 

33 

-0.0328 

-0.0624 

-0.0096 

4 

0.0224 

-0.0562 

0.0000 

34 

-0.0346 

-0.1923 

-0.0752 

5 

0.0100 

-0.0322 

0.0000 

35 

-0.0627 

0.0180 

0.0115 

6 

-0.0051 

-0.0356 

0.0000 

36 

-0.1362 

-0.0887 

-0.0237 

7 

0.0265 

-0.0335 

0.0000 

37 

-0.0959 

-0.1177 

-0.0288 

8 

0.0232 

0.0178 

0.0000 

38 

-0.0955 

-0.0986 

-0.0239 

9 

0.0194 

-0.0507 

0.0000 

39 

0.0568 

0.0543 

0.0009 

10 

0.0079 

-0.0297 

0.0000 

40 

-0.1474 

-0.0616 

-0.0186 

11 

-0.0004 

-0.0285 

0.0000 

41 

-0.6196 

-0.1671 

-0.0617 

12 

0.0214 

-0.0359 

0.0000 

42 

0.0125 

-0.0379 

-0.0056 

13 

0.0367 

0.0105 

0.0000 

43 

-0.0631 

-0.1243 

-0.0441 

14 

0.0499 

0.0597 

0.0000 

44 

0.0206 

-0.0255 

-0.0056 

15 

-0.0217 

-0.0368 

0.0000 

45 

-0.0167 

-0.1042 

-0.0211 

16 

0.0081 

-0.0361 

0.0000 

46 

0.0100 

-0.0484 

-0.0051 

17 

0.0228 

-0.0344 

0.0000 

47 

-0.0030 

-0.0491 

-0.0078 

18 

0.0238 

-0.0269 

0.0000 

48 

0.0525 

0.0701 

0.0285 

19 

0.0177 

-0.0385 

0.0000 

49 

0.0816 

0.0053 

0.0298 

20 

-0.0234 

-0.0191 

0.0000 

50 

-0.0322 

-0.0672 

-0.0092 

21 

0.0191 

-0.0592 

0.0000 

51 

0.0255 

-0.0025 

0.0257 

22 

0.0079 

-0.0231 

0.0000 

52 

-0.3427 

-0.0794 

-0.0231 

23 

0.0013 

-0.0431 

0.0000 

53 

0.0427 

-0.0052 

0.0115 

24 

0.0114 

-0.0467 

0.0000 

54 

-0.0358 

-0.1956 

-0.0681 

25 

0.0078 

-0.0440 

0.0000 

55 

-0.0121 

-0.0165 

0.0016 

26 

0.0130 

-0.0412 

0.0000 

56 

0.0878 

0.1686 

0.0809 

27 

0.0334 

-0.0426 

0.0077 

57 

-0.0111 

-0.1504 

-0.0548 

28 

0.0553 

-0.0319 

0.0019 

58 

-0.0568 

-0.0839 

-0.0306 

29 

0.0378 

-0.0182 

0.0131 

59 

-0.0811 

-0.1833 

-0.0612 

30 

-0.0566 

-0.0742 

-0.0149 

60 

0.0334 

-0.0872 

-0.0283 
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Table 4 


Bias of Estimated Item Parameters With 
Calibration Sample Size 1,000, Based on 100 Replications 


Item 

a 

b 

c 

Item 

a 

b 

c 

1 

0.0132 

-0.0342 

0.0000 

31 

0.0750 

0.0020 

0.0357 

2 

-0.0007 

-0.0413 

0.0000 

32 

0.0078 

-0.0477 

-0.0005 

3 

-0.0055 

-0.0388 

0.0000 

33 

-0.0213 

-0.0454 

-0.0027 

4 

0.0072 

-0.0498 

0.0000 

34 

-0.0302 

-0.1765 

-0.0667 

5 

0.0050 

-0.0317 

0.0000 

35 

-0.0065 

-0.0145 

0.0060 

6 

-0.0059 

-0.0437 

0.0000 

36 

-0.0674 

-0.0719 

-0.0125 

7 

0.0172 

-0.0290 

0.0000 

37 

-0.0587 

-0.0871 

-0.0156 

8 

0.0151 

0.0006 

0.0000 

38 

-0.0454 

-0.0801 

-0.0182 

9 

0.0176 

-0.0424 

0.0000 

39 

0.0367 

0.0403 

0.0067 

10 

0.0053 

-0.0347 

0.0000 

40 

-0.0768 

-0.0557 

-0.0110 

11 

-0.0068 

-0.0350 

0.0000 

41 

-0.3395 

-0.0972 

-0.0293 

12 

0.0074 

-0.0367 

0.0000 

42 

0.0096 

-0.0320 

-0.0004 

13 

0.0162 

-0.0181 

0.0000 

43 

-0.0546 

-0.1030 

-0.0337 

14 

0.0317 

0.0147 

0.0000 

44 

0.0117 

-0.0314 

-0.0002 

15 

-0.0081 

-0.0421 

0.0000 

45 

-0.0055 

-0.0833 

-0.0127 

16 

0.0042 

-0.0373 

0.0000 

46 

0.0113 

-0.0500 

-0.0011 

17 

0.0165 

-0.0322 

0.0000 

47 

-0.0071 

-0.0436 

-0.0045 

18 

0.0116 

-0.0351 

0.0000 

48 

0.0423 

0.0622 

0.0337 

19 

0.0101 

-0.0451 

0.0000 

49 

0.0589 

-0.0073 

0.0206 

20 

-0.0029 

-0.0363 

0.0000 

50 

-0.0044 

-0.0575 

-0.0064 

21 

0.0122 

-0.0452 

0.0000 

51 

0.0277 

-0.0318 

0.0153 

22 

0.0052 

-0.0338 

0.0000 

52 

-0.1896 

-0.0680 

-0.0138 

23 

0.0112 

-0.0381 

0.0000 

53 

0.0321 

-0.0089 

0.0147 

24 

0.0062 

-0.0397 

0.0000 

54 

-0.0434 

-0.1845 

-0.0631 

25 

0.0027 

-0.0415 

0.0000 

55 

0.0022 

-0.0332 

0.0007 

26 

0.0147 

-0.0412 

0.0000 

56 

0.0835 

0.1716 

0.0789 

27 

0.0634 

-0.0285 

0.0087 

57 

-0.0101 

-0.1329 

-0.0498 

28 

0.0389 

-0.0122 

0.0084 

58 

-0.0317 

-0.0802 

-0.0202 

29 

0.0602 

-0.0269 

0.0098 

59 

-0.0632 

-0.1485 

-0.0477 

30 

-0.0646 

-0.0673 

-0.0114 

60 

0.0229 

-0.0792 

-0.0196 
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Table 5 


Bias and RMSE of Ability Estimates When Item Parameters 
Are Estimated With Calibration Sample Size 250 


Ability 


Bias 



RMSE 


MLE 

WLE 

CWLE 

MLE 

WLE 

CWLE 

-3.0 

-0.0428 

0.2309 

0.0756 

0.7396 

0.6978 

0.7483 

-2.5 

-0.1206 

0.0919 

-0.0160 

0.7432 

0.6254 

0.6990 

-2.0 

-0.0749 

0.0490 

0.0000 

0.6056 

0.5057 

0.5662 

-1.5 

-0.0666 

-0.0064 

-0.0037 

0.4396 

0.3871 

0.4281 

-1.0 

-0.0555 

-0.0306 

0.0117 

0.3239 

0.3027 

0.3211 

-0.5 

-0.0595 

-0.0514 

0.0128 

0.2627 

0.2563 

0.2595 

0.0 

-0.0735 

-0.0716 

-0.0012 

0.2347 

0.2323 

0.2196 

0.5 

-0.0695 

-0.0743 

-0.0184 

0.2301 

0.2283 

0.2107 

1.0 

-0.0341 

-0.0514 

-0.0105 

0.2477 

0.2431 

0.2332 

1.5 

0.0110 

-0.0366 

-0.0112 

0.3369 

0.3092 

0.2967 

2.0 

0.1080 

-0.0191 

-0.0169 

0.5420 

0.4360 

0.4148 

2.5 

0.2134 

-0.0383 

-0.0607 

0.7094 

0.5394 

0.5145 

3.0 

0.2117 

-0.1541 

-0.1960 

0.7176 

0.5818 

0.5691 




Table 6 




Bias and RMSE of Ability Estimates When Item Parameters 

Are Estimated 

With Calibration Sample Size 500 



Bias 



RMSE 


Ability 

MLE 

WLE 

CWLE 

MLE 

WLE 

CWLE 

-3.0 

-0.0931 

0.1805 

0.0870 

0.7282 

0.6793 

0.7204 

-2.5 

-0.1682 

0.0498 

-0.0105 

0.7475 

0.6208 

0.6741 

-2.0 

-0.1105 

0.0216 

0.0010 

0.6204 

0.5062 

0.5467 

-1.5 

-0.0846 

-0.0203 

-0.0084 

0.4471 

0.3887 

0.4131 

-1.0 

-0.0573 

-0.0305 

0.0037 

0.3231 

0.3010 

0.3098 

-0.5 

-0.0491 

-0.0404 

0.0053 

0.2567 

0.2500 

0.2516 

0.0 

-0.0553 

-0.0527 

-0.0002 

0.2234 

0.2209 

0.2150 

0.5 

-0.0510 

-0.0552 

-0.0099 

0.2157 

0.2132 

0.2017 

1.0 

-0.0222 

-0.0392 

-0.0038 

0.2319 

0.2268 

0.2208 

1.5 

0.0134 

-0.0338 

-0.0043 

0.3202 

0.2908 

0.2856 

2.0 

0.1010 

-0.0277 

-0.0044 

0.5254 

0.4129 

0.4075 

2.5 

0.2029 

-0.0581 

-0.0397 

0.6985 

0.5137 

0.5083 

3.0 

0.2021 

-0.1838 

-0.1677 

0.7110 

0.5597 

0.5527 
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Table 7 


Bias and RMSE of Ability Estimates When Item Parameters 
Are Estimated With Calibration Sample Size 1,000 


Ability 


Bias 



RMSE 


MLE 

WLE 

CWLE 

MLE 

WLE 

CWLE 

-3.0 

-0.1193 

0.1554 

0.0986 

0.7229 

0.6698 

0.7055 

-2.5 

-0.1913 

0.0281 

-0.0025 

0.7497 

0.6196 

0.6620 

-2.0 

-0.1299 

0.0048 

0.0051 

0.6253 

0.5060 

0.5372 

-1.5 

-0.0984 

-0.0323 

-0.0082 

0.4546 

0.3922 

0.4087 

-1.0 

-0.0643 

-0.0367 

0.0016 

0.3245 

0.3009 

0.3044 

-0.5 

-0.0510 

-0.0421 

0.0017 

0.2568 

0.2499 

0.2489 

0.0 

-0.0525 

-0.0494 

-0.0005 

0.2217 

0.2194 

0.2151 

0.5 

-0.0491 

-0.0530 

-0.0061 

0.2101 

0.2071 

0.1982 

1.0 

-0.0261 

-0.0432 

-0.0012 

0.2243 

0.2197 

0.2144 

1.5 

0.0050 

-0.0422 

-0.0015 

0.3109 

0.2821 

0.2787 

2.0 

0.0879 

-0.0418 

-0.0016 

0.5150 

0.4022 

0.4003 

2.5 

0.1919 

-0.0761 

-0.0354 

0.6935 

0.5004 

0.4979 

3.0 

0.1940 

-0.2055 

-0.1632 

0.7103 

0.5507 

0.5395 


Figures 1-3 illustrate that the MLE has negative bias at low ability levels and positive bias 
for high ability levels (i.e., outward bias), while the bias of the WLE has an opposite pattern. The 
figures also clearly show that the CWLE successfully reduced the bias in the cases considered here. 

Note that the ability-estimation program used in this study searches for the maximum values 
of (weighted) likelihood functions only on [—4,4]. This restriction may cause the irregular results 
at the extreme ability levels considered in this paper, especially at Level —3. 

Although it reduces the bias, CWLE does not always reduce the RMSE at the ability levels 
considered here. It even produces slightly larger RMSEs than WLE when the true ability is at the 
left side of ability scale (see Tables 5-7 or Figures 1-3). The main reason may be that the error 
terms in (16) were evaluated at the weighted likelihood estimates, instead of being evaluated at 
the true ability values. 
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Figure 1. Bias and RMSE of ability estimates when item parameters are estimated with 

calibration sample size 250. 
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Figure 2. Bias and RMSE of ability estimates when item parameters are estimated with 

calibration sample size 500. 
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Figure 3. Bias and RMSE of ability estimates when item parameters are estimated with 

calibration sample size 1,000. 
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Note that the average of the difficulty parameters in Table 1 is around zero (—0.0401). 
Thus, 9 W — 9 is larger when 9 is away from zero and smaller when 9 is near zero. Hence, noise 
has been added when we try to correct the bias by substituting [ J n {9 w ) + Q n (9 w )\/I(9 w ) for 
[ J n (6 ) + Q n (9)]/I(9), and the noise may not be small when 9 is far away from the average of 
the difficulty parameters in IRT models. When 9 is far away from the average of the difficulty 
parameters in IRT models, 9 m or 9 W typically has large bias or RMSE, that is, 9 W may be too far 
away from 9, which violates the assumption \/n(9 w — 9) = O p ( 1). To confirm this, we re-evaluated 
the error terms in (16) at true ability values rather than estimated ones (but item parameters and 
their covariance matrixes were still the estimated ones). That is, 

Lta = L ~ [Jn(0) + Qn(0)\/I(9). 

Though it is not practical, 9 wta did produce slightly, but uniformly smaller RMSE than 9 W 
produced. In summary, the simulation study suggests that the CWLE is a useful alternative to 
9 m or 9 W especially when 9 is within the range of difficulty parameters. 

4 Discussion 

The accuracy of ability estimates is very important because estimated ability scores are 
the major measurement output of a test that is analyzed using IRT models. This paper tries 
to reduce the bias of the WLE of ability caused by treating item parameters estimated from a 
calibration sample as if they were true. Based on the results of the simulation study, the CWLE is 
effective in reducing the bias of the WLE in the cases considered here. However, CWLE does not 
reduce the RMSE of the ability estimator when the values of true ability parameters are far below 
the average of item difficulty parameters in a test. This weakness is not relevant in computerized 
adaptive testing (CAT), since CAT always tries to match item difficulty level with the examinee’s 
ability level. Therefore, in CAT, CWLE can reduce not only the bias of the ability estimator but 
also the RMSE. In this study, we only test the CWLE method in limited cases. To determine the 
capacity and limitations of CWLE, further theoretical and simulation studies are needed. 

It is important to note that the effectiveness of CWLE depends on the calibration program 
(software) used to estimate item parameters and the covariance matrixes of estimated item 
parameters. Hence, one should check if a calibration program can produce reasonable estimates 
of covariance matrixes before applying the CWLE method. The CWLE method is effective only 
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when both the bias and the covariance matrixes of estimated item parameters are well estimated. 
As discussed in Section 3, the bias of estimated item parameters, which is typically not directly 
available from a calibration program, can be obtained by a Monte-Carlo simulation with some 
replications or the bootstrap method. As a by-product, one can also obtain sample covariance 
matrixes of estimated item parameters based on the bootstrap method. These covariance matrixes 
can be used to check the accuracy of the covariance matrixes provided by the calibration program 
and/or applied directly to (16). 

Another way to deal with the uncertainty about item parameters is to make use of the 
expected response functions (ERFs; Lewis, 1985, 2001; Mislevy, et al., 1994). An ERF is the 
expectation of an IRF with respect to the posterior distributions of item parameters. This 
Bayesian approach takes the uncertainty about item parameters into account by substituting 
ERFs for IRFs in the likelihood function. Lee and Zhang (2007) to compared this method with 
CWLE by a simulation study. For details, see Lee and Zhang. 
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Notes 


1 In Bayesian setting, item parameters are usually assumed to be independent between items, 
that is, {( ai,bi,Ci )} is an independent sequence of random vectors (see Lewis, 2001). Lewis argued 
that this is almost a necessary condition. In practice, only the covariances of item parameter 
estimators within an item are available and the covariances of item parameter estimators between 
items are zero. Thus, it is not too unreasonable to assume that {(e 0 j, £bi, £bi)} is an independent 
sequence of random vectors. 
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